我们对13个最近的模型进行了全面评估,用于使用两个流行的收藏(MS MARCO文档和Robust04)排名长期文档。我们的模型动物园包括两个专门的变压器模型(例如longformer),它们可以处理长文档而无需分配它们。一路上,我们记录了有关培训和比较此类模型的几个困难。有些令人惊讶的是,我们发现简单的第一个基线(满足典型变压器模型的输入序列约束的截断文档)非常有效。我们分析相关段落的分布(内部文档),以解释这种现象。我们进一步认为,尽管它们广泛使用,但Robust04和MS Marco文档对于基准长期模型并不是特别有用。
translated by 谷歌翻译
由于高注重成本,充分利用现有的人类创建的培训数据是一个重要的研究方向。因此,我们对五个英语数据集进行了对伯特的神经排名模式的可转移性的系统评估。以前的研究主要集中在零拍摄和几秒钟从一个大型数据集转移到具有少量查询的数据集。相比之下,我们的每个集合都具有大量的查询,可以实现全拍评估模式并提高结果的可靠性。此外,由于源数据集许可证通常禁止商业用途,因此我们比较转移学习以对BM25得分手产生的伪标签培训。我们发现对伪标签的培训 - 可能使用适度的注释查询的后续调整 - 与转移学习相比,可以产生竞争或更好的模型。然而,有必要提高几次拍摄训练的稳定性和/或有效性,有时可以降低预磨料模型的性能。
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of ``task prompts'', each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as ``Promptonomy'', since the prompts model a task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the ``Promptonomy'' approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets.
translated by 谷歌翻译
Recent advances in pixel-level tasks (e.g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features. However, such pixel-to-region associations and the resulting representation, which often take the form of attention, cannot model the underlying semantic structure of the scene (e.g., individual objects and, by extension, their interactions). In this work, we take a step toward addressing this limitation. Specifically, we propose an architecture where we learn to project image features into latent region representations and perform global reasoning across them, using a transformer, to produce contextualized and scene-consistent representations that are then fused with original pixel-level features. Our design enables the latent regions to represent semantically meaningful concepts, by ensuring that activated regions are spatially disjoint and unions of such regions correspond to connected object segments. The resulting semantic global reasoning (SGR) is end-to-end trainable and can be combined with any semantic segmentation framework and backbone. Combining SGR with DeepLabV3 results in a semantic segmentation performance that is competitive to the state-of-the-art, while resulting in more semantically interpretable and diverse region representations, which we show can effectively transfer to detection and instance segmentation. Further, we propose a new metric that allows us to measure the semantics of representations at both the object class and instance level.
translated by 谷歌翻译
Quantum machine learning has become an area of growing interest but has certain theoretical and hardware-specific limitations. Notably, the problem of vanishing gradients, or barren plateaus, renders the training impossible for circuits with high qubit counts, imposing a limit on the number of qubits that data scientists can use for solving problems. Independently, angle-embedded supervised quantum neural networks were shown to produce truncated Fourier series with a degree directly dependent on two factors: the depth of the encoding, and the number of parallel qubits the encoding is applied to. The degree of the Fourier series limits the model expressivity. This work introduces two new architectures whose Fourier degrees grow exponentially: the sequential and parallel exponential quantum machine learning architectures. This is done by efficiently using the available Hilbert space when encoding, increasing the expressivity of the quantum encoding. Therefore, the exponential growth allows staying at the low-qubit limit to create highly expressive circuits avoiding barren plateaus. Practically, parallel exponential architecture was shown to outperform the existing linear architectures by reducing their final mean square error value by up to 44.7% in a one-dimensional test problem. Furthermore, the feasibility of this technique was also shown on a trapped ion quantum processing unit.
translated by 谷歌翻译
Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e, point estimation, we propose GraphPNAS a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on three search spaces, including the challenging RandWire on TinyImageNet, ENAS on CIFAR10, and NAS-Bench-101/201. The complexity of RandWire is significantly larger than other search spaces in the literature. We show that our proposed graph generator consistently outperforms RNN-based one and achieves better or comparable performances than state-of-the-art NAS methods.
translated by 谷歌翻译
There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
translated by 谷歌翻译
自从出现以来,在大型,随机收集的数据上训练的视觉模型在许多领域都有重大影响。但是,由于它们在各个领域表现出色,例如图像文本 - 取回,因此他们的内部工作仍未得到充分了解。当前的工作分析了这些模型的真实零击功能。我们从分析培训语料库的分析开始,评估测试类的程度(以及哪个)实际上是零射击,以及与单个类别的性能如何相关。我们跟进这些模型的基于属性的零击学习能力的分析,以评估这种经典的零击概念从大规模的监督中出现的方式。我们利用最近发布的LAION400M数据语料库以及公开可用的剪辑,OpenClip和Flava的模型,评估了基于属性的CUB和AWA2基准的零摄影功能。我们的分析表明:(i)在预训练期间(很多)观察到大多数流行的零射门基准中的大多数类别; (ii)零射击性能主要来自模型识别类标签的能力,每当它们存在于文本中时,并且只有在不使用类标签时才能观察到基于属性的zeroshot学习的较低的性能能力; (iii)所使用的属性数量可能会对性能产生重大影响,并且很容易导致大幅下降。
translated by 谷歌翻译
截至今天,基于卷积神经网络-CNN的算法实现了线段检测(LSD)的最佳准确性(LSD)。不幸的是,这些方法利用了深度,重型网络,并且比传统的基于模型的检测器慢。在本文中,我们通过将轻量级CNN纳入经典的LSD检测器中,建立了准确但快速的基于CNN的检测器LSDNET。具体而言,我们用轻量级的CNN替换了原始LSD算法的第一步 - 线段段热图和切线场的构造 - 能够计算出更复杂和丰富的特征。 LSD算法的第二部分仅用于次要修改。与标准线框数据集上的几个现代线段探测器相比,所提出的LSDNET可提供214 fps的最高速度(在基于CNN的探测器中),竞争精度为78 FH。尽管最佳报告的精度为33 fps的83 fh,但我们推测观察到的精度差距是由注释错误引起的,实际差距明显较低。我们指出了流行线检测基准的注释中的系统不一致 - 线框和约克城市,仔细地重新注册了一部分图像,并表明(i)现有检测器在不进行重新训练的情况下改善了质量,而无需重新培训,表明新的注释与新的注释相关,使得新的注释更好地与之相关。正确的线段检测概念; (ii)我们检测器的精度与其他人之间的差距减少到可忽略的0.2 FH,而我们的方法最快。
translated by 谷歌翻译